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An improved algorithm is proposed for Monte Carlo methods to study fermion systems in- 
teracting with adiabatical fields. To obtain a weight for each Monte Carlo sample with a fixed 
configuration of adiabatical fields, a series expansion using Chebyshev polynomials is applied. 
By introducing truncations of matrix operations in a systematic and controlled way, it is shown 
that the cpu time is reduced from 0{N i ) to O(N) where N is the system size. Benchmark 
results show that the implementation of the algorithm makes it possible to perform systematic 
investigations of critical phenomena using system-size scalings even for an electronic model in 
three dimensions, within a realistic cpu timescale. 
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§1. Introduction 

There exists a lot of interests in the class of strongly 
correlated electron systems where itinerant electrons are 
coupled to adiabatical fields. An example is the strongly 
coupled electron-lattice system, while the other is the 
double-exchange system where electrons are coupled to 
localized classical spins. Models which represent dilute 
magnetic semiconductors also belong to this class. 

Ground states of such systems may be studied by 
mean-field approaches, since fluctuations are frozen and 
irrelevant. However, in order to study finite temperature 
properties, especially near the critical temperature, it is 
necessary to take into account fluctuations of the fields. 
Natures of the field fluctuations are important to under- 
stand critical phenomena as well as changes of electronic 
properties around critical points. 

Since dynamics of the fluctuations are adiabatically 
slow, electrons can respond to the fluctuating poten- 
tials precisely so that the electronic states are far from 
those without fluctuations. Therefore, in the presence of 
critical fluctuations which are strongly coupled to elec- 
trons, various theoretical methods such as mean-field ap- 
proaches and perturbation theories are invalid. 

Numerical studies provide promising methods to cal- 
culate such systems. Especially, Monte Carlo (MC) 
method is suited for calculation of these models. The 
advantage of this method is that it is possible to obtain 
thermodynamics of the model on a finite size lattice by 
taking partition sums for fluctuating fields which are re- 
placed by stochastic samplings. 1 ** 

However, MC studies suffer from finite-size effects 
since the system size is limited, due to an increase of 
the computational complexities and hence cpu time as 
system sizes are increased. In order to study thermody- 
namic properties of the model properly, it is requisite to 
perform extrapolations to the thermodynamic limit as 



well as finite size scalings. Namely, systematic calcula- 
tions for various lattice sizes which are large enough for 
analyses are necessary. In the conventional algorithm, l > 
the computational complexity scales as 0(N 4 ), where N 
is the system size. Therefore, it is extremely difficult to 
increase the system size. 

In order to overcome the difficulty, improved algo- 
rithms have been proposed to reduce the computational 
complexity for the calculation of the Boltzmann weight. 
The authors have introduced the polynomial expansion 
method (PEM), where the computational complexity is 
reduced to 0(N 3 ). 2 ' 3 ^ Alonso et al. 4 ^ have applied the 
hybrid MC algorithm which makes the computational 
complexity to scale as 0(N 2 ). Using these new meth- 
ods, larger system sizes become available. As an exam- 
ple, critical phenomena of the two-dimensional double- 
exchange model have been investigated using finite-size 
scaling analysis and non-equilibrium relaxation stud- 
ies. 5, 6 ) However, these algorithms are still not sufficient 
enough to study systems which require much larger com- 
putational scales, such as models with more complex in- 
teractions or those in three dimensions. 

In this paper, we present a new algorithm which fur- 
ther reduces the computational complexity of the MC 
calculations to 0(N). In §2, we briefly describe the 
PEM in order to make this article self-contained. In §3, 
we introduce a truncation method to improve the PEM. 
Benchmark results and estimates of the truncation er- 
rors are shown in §4. Sec. 5 is devoted to summary and 
discussions. 

§2. Polynomial Expansion Monte Carlo Method 

2.1 Hamiltonian matrices 

Throughout this paper, we consider a system where 
the Hamiltonian operator is expressed in a quadratic 
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form, 



H(<t>) = ^24H ij (<j>)c j . 



(1) 



Here, Cj (cj) represents a fermion annihilation (creation) 
operator for an index i. Each index represents fermionic 
degrees of freedom, typically a combination of site, or- 
bital and spin. Matrix elements depend on the adia- 
batical fields which are expressed by <p. This means 
that we restrict ourselves to a class of electronic systems 
on lattices coupled to adiabatical fields which give, e.g., 
charge/spin density potentials as well as those coupled 
to orbital degrees of freedom as in Jahn- Teller couplings. 
We assume the absence of electron-electron interactions. 

Later in this paper, simple examples of the Hamilto- 
nian (1) will be given in Eq. (22), where spinless electrons 
are coupled to on-site potential fields, and in Eq. (41) 
where electrons are coupled to localized spins defined on 
each site in such a way that transfer energies are modu- 
lated by configurations of the spins. 

Let us describe the class of systems which are ex- 
pressed by the Hamiltonian (1) more precisely. In usual 
cases, diagonal elements of the Hamiltonian matrices de- 
scribe potential energies, e.g., charge density potentials 
coupled to adiabatical fields. Electron hopping terms, 
as well as coupling to transverse fields give off-diagonal 
matrix elements of H . Within the scheme we do not con- 
sider types of fields which break electron number conser- 
vation, e.g., coupling to singlet superconducting fields 
in a form Ajct^ct^ + h.c, unless there exists a canoni- 
cal transformation which maps the system to a particle 
number conserving system, e.g., — > and 
in the previous example. 

For a fixed configuration of the adiabatical fields, the 
Hamiltonian in Eq. (1) shows a one-body electron sys- 
tem with random potentials. The Hamiltonian matrices 
H have a matrix dimension iVd; m defined by the total 
number of fermionic degrees of freedom, which is pro- 
portional to the system size N. Within this article we 
restrict ourselves to the cases where H are sparse matri- 
ces, namely, the model has short range hoppings in usual 
cases. 

We assume that the adiabatical fields are locally de- 
fined, typically on sites or bonds. Then, the total num- 
ber of the adiabatical fields is proportional to N. We 
also restrict ourselves to the case where interactions be- 
tween the fields and the electrons are short-ranged, i.e., 
the number of matrix elements which are modulated by 
the change of an adiabatical field is O(N ). 

2.2 Boltzmann weight 

The partition function for the Hamiltonian (1) is writ- 
ten as 



Z = Tr c Tr F cxp (-0 \n (<p) - ^e] ) , 



(2) 



where Trc is the trace over adiabatical fields d>, while 
Trp is the grand canonical trace over fermion degrees of 
freedom. Here, (3 is the inverse temperature, fi is the 
chemical potential and N e is the particle-number opera- 
tor. 



In MC approaches, the trace over adiabatical fields is 
replaced by the stochastical sampling of the field config- 
urations <fi whose Boltzmann weight is given by 



P(0) = |cxp [-S off (</>)] . 



(3) 



Here, the effective action S e s is defined by 

S eff (0) = -log(lV F e-^)-^]) 

Wdim 

= 5>M</>)), 



where 



F(x) = — log 



1 + e 



-P{x-n) 



(4) 



(5) 



while £„ is the v-th. eigenvalue of the Hamiltonian matrix 
H for a given configuration of 4>. 

In an importance sampling MC method, probability of 
an update from an old field configuration </> old to a new 
configuration d> new depends on the ratio of the Boltz- 
mann weights which is given by 

- = ^Q. (6) 

P((f>° ld ) 

In a local update of the adiabatical fields, we generate 
0ncw f rom ^oid - n g^jj a wa y ^at only one or a local 

group of adiabatical fields is modified from old but the 
rest are unchanged. The definition of a MC step with 
local updates is that we make a sweep of local updates 
so that all adiabatical field variables are sequentially ex- 
amined for updates. 

One of the method to calculate the Boltzmann weight 
P from Eqs. (3)-(5) is the diagonalization method (DM) 
where £„(</>) are exactly obtained by direct diagonaliza- 
tions of the Hamiltonian matrix H^). 1 ^ The compu- 
tational complexity for each matrix diagonalization to 
obtain all eigenvalues is 0(Adi m 3 )- In a MC step with 
local updates, the number of trials for the field upgrades 
scales as O(N). Since Adim oc N, the total computa- 
tional complexity for a MC step by the DM is (9(iV 4 ). 

2.3 Polynomial expansion method 

An approach to reduce the computational complexity 
in the calculation of P(<p) is to perform a polynomial 
expansion. 2 ) When F{x) in Eq. (5) is expanded by a 
series of polynomials {T m {x)} in a form 



F(x) = f m T m (x), 
we may rewrite Eq. (4) as 

S c fi((f>) — ^ fm Mm 7 



(7) 



(8) 



where /i m is the polynomial moments of the Hamiltonian 
defined by 

ti m =J2 T ™M0)) = Tr T m (H (</>)). (9) 
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Here Tr represents a trace operation for the matrix poly- 
nomials. The values of f m depend on temperature and 
chemical potential but not on the adiabatical fields <p. 
Moments \i m depend on 0, which are calculated through 
trace operations and matrix addition/multiplications of 
H{4>). Therefore, the expansion of the effective action 
SeB by a polynomial series enables us to obtain Boltz- 
mann weights for each update of the adiabatical fields 
through simple matrix operations only. 

Among various choices of polynomials for the series ex- 
pansion in Eq. (7), the Chebyshev polynomials 7,8 ' give 
us an advantage that the expansion coefficients f m de- 
cay quickly for m > 1 in an exponential way. 2 -* The 
Chebyshev polynomials {T m (x)} for m = 0, 1, . . . are re- 
cursively defined by 



To(x) = 1, Ti(x) = x, 
T m {x) = 2xT m -\{x) - T m - 2 (x). 



(10) 



Within the region — 1 < x < 1, the Chebyshev polyno- 
mials show an orthonormal property in a form 

r-1 



Ax 



ttVI 



j-2"m (x)T m ' (x) — ^m^rt 



where 



1, m = 0, 

1/2, m ^ 0. 



(11) 



(12) 



Using this relation, the coefficients in Eq. (7) are ob- 
tained as 



fn 



1 



dx 



<>,„ J_i iry/l — x 2 



F(x)T m (x). (13) 



In Fig. 1 we show some examples for absolute values of 
the coefficients |/ TO | where we see exponential decays. 
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Fig. 1. Absolute values of the coefficient \f m \ for the Chebyshev 
polynomial expansion of S e g, for j3 = 10, 30 and 100 at fi = 0. 



From Eq. (9) we have 



d £ Z?0(e)T m (e) 



AO sin OD^ff) cos mO, 



where 



(14) 



(15) 



is the density of states (DOS) for the eigenvalues of 
H(4>). Here we used an alternative definition of the 
Chebyshev polynomials, 



T m (cos6>) = cos m6. 



(16) 



We note here that, since Chebyshev polynomials T m (x) 
as an orthonormal set are defined in the region — 1 < 
x < 1, the Hamiltonian matrices have to be scaled prop- 
erly so that the eigenvalues satisfy — 1 < £u{4>) < 1 f° r 
v — 1, . . . , iVdi m . From Eq. (14) we see that fj, m is a 
Fourier transform of sin OD^ (cos 8). This means that 
\fi m \ either converge to zero at m ^ 1 if the DOS is 
non-singular, or converge to a constant in the most ex- 
treme cases where there exist macroscopic degeneracies 
in the DOS. Therefore, the series f m (i m in Eq. (8) decays 
exponentially. 

Replacement of the infinite sum over m in Eq. (8) by 
a finite sum up to m = M gives us accurate results if 
an appropriate value of M is chosen. Although such a 
value depends on an asymptotic form of [i m which is not 
predictable a priori, we can always estimate truncation 
errors of PEM due to finite M through comparisons be- 
tween the results by the PEM and the DM. Note that, 
even when it is difficult to perform a product run by the 
DM with large enough MC steps to obtain statistically 
accurate results, it is usually possible to execute a few 
trial MC steps in order to estimate truncation errors. In 
MC runs in practice, truncation errors can be neglected 
if they are sufficiently smaller than the statistical errors. 

2.4 Algorithm 

The algorithm for the PEM using the Chebyshev poly- 
nomials is as follows. We introduce a set of orthonormal 
vectors {e(k)} (k = 1, . . . , N&m), and calculate 

v(k,m) = T m (H)e(k), (17) 

for m > 0. Then, the trace \i m is given by 



where 



k—1 



Vm{k) = (e(fc), v(k,m)) . 



(18) 



(19) 



Here ( , ) represents an inner product. 

Using a recursion relation for the Chebyshev polyno- 
mials in Eq. (10), we obtain v{k,m) as 



v(k,0)=e(k), v(k,l) = Hv(k,0), (20) 



and 



Ui(k, to) = 2 Hij Vj(k, to — 1) — Vi(k, m — 2), (21) 



for i = 1, . . . , Ndim at to > 2, where m) is the i-th 
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element of the vector v(k,m). 

The summation J2j m Eq. (21) can be restricted to j 
with non-zero matrix elements Hij ^ 0. Since we assume 
sparse matrices in Eq. (21), computational complexity 
of these matrix operations are O(N). Therefore, for a 
fixed configuration of <p, the computational complexity 
to obtain n m (k) (0 < m < M) is 0(MN), while that for 
the trace operation is O(N). Then, the complexity of the 
Boltzmann weight calculation scales as 0(MN 2 ). This 
means that, for one MC step with local updates where 
O(N) field variables are manipulated sequentially, the 
computational complexity is 0(MN 3 ). Compared to the 
DM which scales as 0(N 4 ), the PEM is advantageous if 
M < N. 

It has also been shown that the algorithm is suited for 
parallel computations since trace operations are mutu- 
ally independent, 2, 9 ' which further reduce elapsed time 
for calculations. As a result, it become possible to inves- 
tigate models with larger system sizes within a reason- 
able scale of cpu time. Using the PEM, critical phenom- 
ena at finite temperatures for a fermionic model in two 
dimensions are studied for the first time by finite-size 
scaling analysis as well as by non-equilibrium relaxation 
technique. 5,6 ' However, the algorithm still turns out to 
be insufficient to study critical phenomena in three di- 
mensions, since the reduction of the computational com- 
plexities is not large enough. 

§3. Truncated Polynomial Expansion Method 

Now we demonstrate that the calculations for the 
Boltzmann weights can further be improved by intro- 
ducing truncated matrix operations. As an example to 
explain the method, let us first consider a simple model 
which has the Hamiltonian matrix in the form 

{9<t>i i = 3, 
—t i and j are nearest neighbors, 
otherwise. 

(22) 

Here t is the nearest neighbor hopping energy for spinless 
electrons while g is the electron-field coupling constant. 
In this system, local adiabatical field <fi = {4>i} is defined 
on each lattice which is coupled to electrons as an on- 
site potential. The Hamiltonian matrix H(c/>) is a sparse 
matrix with N^im — A. 

3. 1 Truncation of matrix products 

In order to obtain /i m (fc) for m = 0,...,M from 
Eqs. (19)-(21), matrix-vector multiplications throughout 
the Hilbert space are necessary, which give 0(MN) com- 
putational complexity. Here we introduce a truncation 
of the matrix-vector multiplications in order to reduce 
the computational complexity. 

Let us choose ej(fc) = for the orthonormal set in 
Eq. (17). Since Vi(k, 0) is non-zero only at i = k, we 
have Vi(k, 1) ^ only at i = k as well as at nearest 
neighbors of k. Namely, due to the sparse nature of the 
Hamiltonian matrix (22), it is not necessary to calculate 
all the other vector elements. Similarly, if one keeps track 
of a set of indices with Vi(k, m — 1) ^ 0, the calculation 
of vector elements u,(fe,m) can be restricted to limited 



numbers of indices, so that the computational complexity 
is much reduced. 

The matrix-vector product in Eq. (21) can be viewed 
as a transfer-matrix multiplication to a state vector, 
which expresses a diffusive propagation of a wavefunc- 
tion. We start from an initial vector v(k,0) = e(k), 
which expresses an electron state localized at site k. 
Each time the Hamiltonian matrix H is multiplied to 
obtain v(k,m), electrons hop to nearest neighbors. As 
a consequence, the sites with non-zero vector elements 
Vi(k,m) spread out as m increases. In Fig. 2 we give 
a schematic illustration for the propagation steps. The 
process also resembles the diffusion of the probability 
distribution function in a random- walk system. 

k I 

m=0 -ooooootooooot- 




RdM) 

Fig. 2. A sketch for the propagation of vector elements. The 
vertical axis gives the matrix-vector multiplication steps. Circles 
aligned in the horizontal direction represent lattice sites. Filled 
circles show sites with non-zero vector elements, and darkness of 
them schematically illustrates amplitudes of the vector elements. 
The initial vector gives a localized state at site k, while the 
hatched circle at £ symbolizes the site where the Hamiltonian 
matrix element is updated. See also the discussion in §3.2 



Let us define the distance between i-th site and k-th. 
site, denoted as \\i — k\\, by the minimum number of 
hops for an electron to transfer from i-th site to /c-th 
site. On hypercubic lattices, this gives the "Manhattan 
distance". We define the range of propagation Ro{m) by 
the longest distance that an electron can hop by m steps 
of the matrix- vector multiplication in Eq. (21). Since 
sites which are outside of the range of propagation have 
zero vector elements, i.e., 

Vi(k,m) = if \\i - k\\ > Ro(m), (23) 

we have no contribution to the calculations of [i m {k) as 
well as v(k, m + 1) from these sites. This means that we 
may perform our matrix-vector calculation only within 
the neighbors of k which satisfies \\i — k\\ < Ro(m) to 
obtain exact results. In the present model (22) we have 
Ro(m) — m, and the number of sites which contributes 
to the overall calculation of (m < M) is propor- 

tional to M d instead of N when every sites on the lattice 
is taken into account. Here d is the spacial dimension of 
the lattice. By introducing this restriction, the compu- 
tational complexity to obtain fj, m (k) (m < M) is reduced 
from 0{MN) to 0{M d+1 ) without any cost for the com- 
putational accuracies. 



Order N Monte Carlo Algorithm 



5 



We can further reduce the range where calculations are 
restricted, by introducing a threshold e for the vector el- 
ements. When the absolute values of the vector elements 
\vi(k, m)\ are small enough, we can neglect such terms in 
the calculation of fj, m (k) and v(k, m + 1). Let us define 
the range of propagation R e (m) by the longest distance 
| \i — k\ | such that the absolute value of the vector element 
on the i-th site exceeds the threshold, \vi(k,m)\ > e. In 
Fig. 2 we give a schematic illustration. Then, we have 



\vi(k,m)\ < e if | \i - k\ | > R e {m), 



(24) 



and contributions to fi m (k) from sites outside of the 
range are negligible if we take values of e appropriately. 

Since the diffusion length is proportional to the square- 
root of the time-steps in general, R e (m) oc y/m for e > 0. 
Then, by introducing a threshold e and restricting the 
calculation within \\i — k\\ < R e (m), the number of sites 
which contributes to fj, m (k) scales as 0(M d / 2 ). Therefore 
the computational complexity to obtain /j,k (m) (m < M) 
is further reduced to 0(M d / 2+1 ) with the accuracies of 
0(e). (To be more specific, the number of sites neglected 
by this treatment is 0(M d ) for the calculation of fi m 
(m = 0, . . . , M), so the total error for the calculation of 
the Boltzmann weight is 0(M d+1 e).) 

Let us now extend the procedure to general cases 
where an index for fermion degrees of freedom repre- 
sents a combination of site, orbital and spin. Since the 
basis set to define the Hamiltonian matrices (1) can be 
taken arbitrary, we may also consider a system where it 
is not well-defined to consider a geometrical distance be- 
tween indices. Nevertheless, as long as the Hamiltonian 
matrices are sparse, we can generalize the algorithm as 
follows. 

We define a subspace Afo(k, m) as a set of neighboring 
indices of the initial index k which are within the range 
of propagation by m steps, 



Af (k,m) = |J {i | Vi {k,m')^Q}. 



(25) 



m'—0 



Note that, in the previous example, A/o(fc,m) is a set 
of indices within the range of Ro(m) from k. Then, we 
may restrict the matrix product operations within the 
subspace No(k,m), since 



Vi(k,m)=0 if i ^ Ao(fc, m) 



(26) 



so that there is no contribution to /i m (fc) from outside of 
W (fc,m). 

Similarly, we define a subspace Af e (k,m) as a set of 
neighboring indices of k where absolute values of the vec- 
tor elements exceed the threshold e, 

m 

K(k,m)= |J {i | \vi(k,m')\>e}. (27) 

m'=0 

In the previous example, Af e (k,m) roughly corresponds 
to a set of sites within the range of R f (m) from k. By 
making truncations of the matrix product operations 
within the restricted subspace J\f e (k,m), we obtain ap- 
proximate results for /J, m (k) within errors of O(e), since 



\vi(k,m)\<e \ii^M e (k,m), 



(28) 



so that contributions from outside of Af e (k, m) are negli- 
gible. 

3.2 Truncation of trace operations 

In order to obtain S e s directly from Eqs. (8) and (9), 
a trace operation throughout the Hilbert space is neces- 
sary, which gives O(N) computational complexity. Here 
we introduce a truncation of the trace operation in order 
to reduce the computational complexity. 

The probability of the MC update from an old field 
configuration <p old to a new configuration <p ncw , which is 
given by the ratio of the Boltzmann weights in Eq. (6), 
can be calculated from 
P(0 ncw ) 



cxp(-AS' off ), 



(29) 



P(0 old ) 

where A^eft is the difference of the effective action. Us 
ing PEM up to the M-th order, we have 



ASoff = S c s(4> ncw ) - S c s{4>° ld ) 



M 



Y /"» Y A^ m (fc), 

rn— k 



(30) 



where 



A Mm (fc) = (e(fc),^ cw (fc,m)) 

-(e{k),v° m (k,m)). 
Here, v a (k,m) for a = (old, new) is defined by 
v a (k,m)=T m (H( ( f> a ))e(k), 



(31) 



(32) 



and we choose ej(fc) = Sik- Summation over k in Eq. (30) 
is the trace operation in Eq. (9). 

Now we consider a local update of the adiabatical fields 
in the present exemplified model (22). Let us choose a 
site I and try a local update on the site </>° ld — > <^f ew , 
while we have 4>° ld = 4>f ew for i ^ /. In this case, the 
change of the Hamiltonian matrix H(cj) old ) — > H((p uew ) 
exists only at the Z, Z-th matrix element, while we have 
H ij ((p old )= i^(0 now ) elsewhere. 

Let us take a site k which is distant enough from the 
updated site I so that ||fc — l\\ > R n (M) is satisfied, 
and consider the diffusion of the vectors v° ld (A;,m) and 
^"(fc, m). In this case, we have v° ld (k, m) = ^^(k, m) 
and hence Afi m (k) = 0. The reason is as follows. For 
m < M, the region where the state vectors propagate 
does not reach the site /, since Ro(m) < \\k — l\\ is sat- 
isfied. In Fig. 2 we give a schematic illustration. The 
matrix elements that are operated to the vectors during 
the diffusion processes are identical between old and new 
configurations. This makes 



old 



(k, m) 



'(jfe.m) 



(33) 



for i which satisfies \ \i — k\ \ < Ro(m). At the same time, 
by the definition of Ro(m) we have v° ld (k,m) = and 
vf cw (k,m) = for i such that \\i - k\\ > R (m). Then, 
we have vf d (k, m) — vf cw (k, m) in the entire space. 

In other words, we have Afi m (k) ^ for m < M only 
if k is close enough to I so that the propagation from the 
site k reaches the site I within M steps. Therefore, it is 
sufficient to take the summation over k in Eq.(30) only 
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within the vicinity of I which satisfies ||fc — 2|| < Ro(M). 
Namely, the trace operation may be restricted to a sub- 
space which has 0(M d ) sites so that the computational 
complexity for the trace operations is reduced from O(N) 
to 0(M d ). 

Furthermore, we introduce a truncation of trace op- 
erations which gives approximate results with a reduced 
computational complexity. Here we consider a general 
case with sparse Hamiltonian matrices. For a moment 
we restrict ourselves to an update of the adiabatical field 
where the matrix elements of H(<p old ) and H((p ncw ) are 
identical except for the I, l-th element. 

Let us consider an initial vector at k and the propa- 
gation of the vectors for H(<p old ) and H(<p ncw ), which 
gives the set of neighboring indices Af° ld (k,m) and 
7V £ now (/c,m), respectively. If I £ Nf cw {k,M) and I £ 
7V° ld (fc, M) are satisfied, both v° ld {k, m) and i^ ew (fc, m) 
are approximately confined within the subspace where 
matrix elements of the Hamiltonian are identical, and 
do not reach the index I. Then we have 

v° ld (fc,m)~w ncw (fc,m), (34) 

within the error threshold e, and therefore A[i m (k) — 
0(e) is satisfied. In other words, only indices in the vicin- 
ity of / where effective propagations to I occur within 
M steps should be considered for the calculations of 
Afj m (k). 

This means that the trace operation in Eq. (30) can be 
restricted to the vicinity of I, defined by a set of indices 
k where I g Af? cw (k, M) or / g Nf d {k, M) are satisfied. 
Using Eq. (32) and the Hermicity of the Hamiltonian 
matrix polynomials 

T m {H)\ lk = {T m {H)\ kl )\ (35) 

we have v"(k,m) — v^(l,m)* for a = (old, new). 
Namely, if and only if k g N°(l,M), we have I g 
N"(k,M). Therefore, the trace operation can be re- 
stricted to a subspace defined by 

V t (l, M) ee Mf d {l, M) U Af? ew (l, M). (36) 

Due to the diffusive nature of the propagation, the num- 
ber of indices in Nf d {l, M) and Af™ w {l, M) is 0(M d / 2 ) 
on usual lattice systems. The truncated trace operation 
within V e reduces the computational complexity from 
0(N) to 0(M d l 2 ) with errors of 0(e). 

We can extend our algorithm to cases where a local 
update modulates matrix elements for multiple indices. 
An example is the case where fields are coupled to off- 
diagonal matrix elements, e.g., hopping amplitudes. Let 
us consider a case where a local update modulates the 
I, I'-th matrix element. Differences between z7° ld and ■y Ilew 
exist if the propagations of the vectors reach either of 
the indices I or I'. Then, we need to consider a sum of 
vicinities centered at I and I', 

V e tot = V £ (i,M)UV £ (i',M), (37) 

and make trace operations within V £ ot . In a general case 
where a number of matrix elements are modulated, we 
need to consider all the indices associated with modu- 
lated matrix elements. We define C as a set of indices 



where matrix elements are modulated by the update, 

C = {1\ 3 r, H u ,(<j>° ld )^H w (<j> new )}- (38) 
Then, the total vicinity V £ ot is given by 

V e tot = \JV e (l,M), (39) 
lec 

and the trace operations are performed within V* ot , i.e., 

M 

AS cS ~ f™ E A ^m(k). (40) 

m=0 fceV* ot 

As long as MC updates are local, the number of indices 
in C is 0(N ), so that the computational complexity for 
the trace operation will be 0{M d l 2 ). 

3.3 Comparison with previous methods 

Thus we see that the PEM using truncated matrix op- 
erations reduces the total computational complexity for 
one local update from 0(MN 2 ) to 0(M d+1 ), by com- 
bining threshold truncations for both matrix products 
and trace operations. Hereafter we refer to the improved 
method in this section with truncations of matrix opera- 
tions as the truncated PEM, whereas the original method 
described in §2 is called as the full PEM. In Table I we 
summarize the computational complexities for various 
algorithms for comparison. 

Table I. Computational complexities to perform calculations of 
Hm(k), trace operations, calculations of the Boltzmann weight 
ratio through A5 e ff , and a MC step with local updates in total. 
Here, f-PEM and t-PEM stand for the full PEM and the trun- 
cated PEM, respectively. Threshold for the truncated PEM is 
described by e. 



Algorithm 




Trace 


ASeff 


Total 


DM 
f-PEM 
t-PEM 
e = 

6^0 


O(MN) 

0(M d+1 ) 
0(Mi +1 ) 


O(N) 

0{M d ) 
0{Mi) 


0(N 3 ) 
0(MN 2 ) 

0(M 2d+1 ) 
0(Af d+1 ) 


0(iV 4 ) 
0{MN 3 ) 

0(M 2d+1 N) 
Q(M d+1 N) 



Let us emphasize here that the restriction of ma- 
trix operations within J\f (k,M) gives us identical re- 
sults for n m {k) to those obtained by the full PEM, 
with a reduced computational complexity. We also have 
Af e ^o(k,M) = Af (k,M). This implies that the intro- 
duction of the threshold e is a controlled approximation 
in the sense that /J, m (k) are obtained with an arbitrary 
accuracy by an appropriate choice of e, with further re- 
duced computational complexities. 

§4. Algorithm and Implementation 

4-1 Algorithm 

Now we clarify actual algorithms to perform the trun- 
cated PEM. We first show an algorithm for the truncated 
matrix product to obtain Vi(k,m) for m < M: 
i) Determine the truncation threshold for matrix prod- 
ucts e p . Set the initial unit vector Vi(k, 0) = Sik, and 
the initial restricted subspace Af €p = {k}. 
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ii) From given {vt (fc, m — 1)} and J\f £ , perform matrix- 
vector product to calculate {vi(k,m)} within re- 
stricted subspace. Namely, indices j in Eq. (21) are 
restricted to j £ _/V £p , whereas indices i also include 
those generated by propagations due to non-zero off- 
diagonal matrix elements of in Eq. (21). 

iii) The treatment for a newly generated index i is as 
follows: If \vi(k,m)\ > e p , register j as a new mem- 
ber of Af Ep . Otherwise, discard the calculation for 
Vi(k,m) and treat it as zero. 

iv) Repeat steps ii) and iii) M times. 

v) As a byproduct, we obtain Af ep (k, M). 

This procedure automatically gives calculations without 
truncation errors by setting e p = 0. 

An algorithm to perform truncated trace operation for 
an update of the field is as follows: 

i) Determine the truncation threshold for trace opera- 
tions e tr . 

ii) Define C as in Eq. (38). Perform truncated matrix 
product operations to obtain v a (l, M) for I £ C and 
a = (new, old), so that the neighbors of the local- 
updated indices AA e " r (Z,M) are determined. Then, 
V*°* is obtained from Eqs. (36) and (39). 

iii) For k £ V£?, calculate v° u (k,M) and t? lcw (fc,M), 
and obtain A/i(fc). Finally, A5 c ff is evaluated from 
Eq. (40). 

This procedure also gives error- free results if etr = 0. 

Let us note that, the thresholds e p and e tr may be 
chosen independently, and the choice of these thresholds 
has to be justified by making proper estimates for the 
truncation errors. 

Although there exists an arbitrary choice for the basis 
set to express the system in a quadratic form by Hamil- 
tonian matrices (1), one should select that which reduces 
the propagation of the state vector as much as possible in 
order to reduce the computational complexities. A pos- 
sible example is the case where there exists a symmetry 
in a system so that the Hamiltonian matrices may be 
block-diagonalized. The symmetry can be weakly bro- 
ken as long as matrix elements between blocks are small. 
Then, using the basis set which block-diagonalize the 
Hamiltonian matrices, the vector propagations will be 
confined within the blocks. 

Similarly, one may also make an extention to the algo- 
rithm so that the procedure for matrix products starts 
from an initial unit vector Vi(k,0) ^ 5ik, provided the 
number of the non-zero vector elements in v(k,0) is 
O(N ). An appropriate choice of the initial vector, typ- 
ically an expression of local symmetries of the system, 
may cancel the propagation to some extent through in- 
terferences. 

4-2 Benchmark 

In order to demonstrate that the algorithm is success- 
fully implemented, we show benchmark results. As a 
model, we choose the double-exchange (DE) model in the 
strong-coupling limit in three dimensions. The Hamilto- 
nian is given by 



where hoppings of spinless electrons to nearest neighbors 
are coupled to classical spin fields {Si} in a form 



cos — cos -1 + sin — sin -L e -<<t>i-h) , (42) 
2 2 2 2 v ' 



to 

Here 6 and </> are defined by the direction of the localized 
spin S as 



Sf = S sin 6; cos fa , Sf = S sin ( 



Sf = Scos( 



(43) 

with the normalization S = 1, while to is the transfer 
integral between nearest neighbors in the absence of the 
DE interaction. A local update for Si modulates all the 
hopping energies from i-th site to its nearest neighbor 
sites. Then, C for site i contains nearest neighbors of i 
as well as i itself. 

For parameters of the MC benchmark run, we choose 
T/W ~ 0.02 and fi/W ~ where T and /j, are tem- 
perature and the chemical potential, respectively, while 
W = 6£o is the half-bandwidth of the model in the ab- 
sence of the interactions. We typically take MC steps 
as iV s tcp ~ 4000. These parameters are typical ones for 
an investigation of the ferromagnetic transition in the 
model. 3 ) Within this parameter range, M = 16 has been 
shown to be enough for the accuracy of the calculation. 
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Fig. 3. Deviations of the estimated values due to truncations, for 
the square of the ferromagnetic moment Stot 2 on a 4 X 4 X 6 lattice 
system at T/W = 0.015, p/W = 0, M = 16 and 7V stcp = 4000. 
Symbols show estimates by the truncated PEM for various values 
of the truncation thresholds e p and e t r- The solid line in the 
figure shows the estimated value of Stot 2 obtained by the DM, 
while the gray area around the line gives the stochastic error bar 
for the estimate. Error bars of each symbols are roughly equal 
to the error bar of the data by the DM. Therefore, if symbols are 
in the gray area, it is conceivable that overall truncation errors 
are roughly equal to or smaller than stochastic errors. 



In Fig. 3, we show truncation errors for various com- 
binations of etr and e p . We see that the truncation 
errors quickly decreases as the thresholds are lowered, 
and becomes sufficiently small compared to the statisti- 
cal errors. From this result, we choose etr = 10~ 3 and 
e p = 10~ 5 which are satisfactory to give accurate results 
with reduced cpu time in the present case. 
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In usual cases, it is expected that we may take etr 
larger than e p . The threshold et r determines the border 
of the region V*°*. For the indices at the border, contri- 
butions to A/i m (fc) come from propagations from these 
indices to those in C, which take small fractions of the 
whole propagations. On the other hand, the threshold e p 
determines the border of all the propagations, including 
those from the indices near C which give relatively large 
values of A/i m (/c). Therefore, AS'eff is more sensitive to 
the threshold e p . 

Figure 4 shows the cpu time per MC step as a function 
of the system size on a single processor system as well 
as on a parallel computational system with the number 
of processor elements iVpE < 24. From the figure we 
confirm that the cpu time is proportional to the system 
size. We also see that it is possible to implement this 
algorithm for efficient parallel computations. 
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Fig. 4. Benchmark results for 6 X 6 X 6, 8 X 8 X 8, 10 X 10 X 10, 
12 X 12 X 12 and 16 X 16 X 16 sites of the double-exchange model 
in three dimensions, at T/W = 0.02, fi/W = and M = 16, 
while etr = 10 — 3 and e p = 10~ 5 are chosen for the thresholds. 
Computations are performed up to Np^, = 24 processor elements 
of an Athron MP 1500+ cluster system which are connected by 
Myrinet 2000. 



Table II gives benchmark results of the truncated PEM 
for various system sizes, in comparison with the diagonal- 
ization method and the full PEM. Using the truncated 
PEM, it becomes possible to calculate large size systems 
in a realistic time scale. For example, DE model in three 
dimensions up to 20 x 20 x 20 becomes available, which 
is large enough to perform a finite-size scaling analysis 
of the critical phenomena in this model. The result is 
reported elsewhere. 10 - ) 

§5. Summary and Discussions 

Under the condition that the full PEM reduces the 
computational complexity for one MC step with local 
updates to 0(N 3 ) from that for the DM which is 0{N 4 ), 
namely, if we may take M = O(N ) to obtain accurate 
results, we have shown that the truncated PEM further 
reduces it to O(N). It is possible to obtain the exact 



Table II. An estimated cpu time of 10, 000 MC steps for various 
system sizes of the DE model in three dimensions, at T/W = 
0.02 and p/W = 0. For the PEM we have M = 16, while e tr = 
10 -3 and € P = 10 -5 are chosen for the truncation thresholds. 
The cpu time is estimated on a personal computer with a single 
processor of the Athron MP1500+ processor. 



System size 


Diagonalization 


Full PEM 


Truncated PEM 


8x8x8 


2.3 years 


82 days 


2.4 days 


12 x 12 X 12 


300 years 


8.7 years 


8 days 


16 x 16 x 16 


9500 years 


120 years 


21 days 



results within the PEM scheme if we take e p = e tr = 0. 
The computational complexity can further be reduced by 
introducing non-zero values for e p and e tr . The trunca- 
tion errors can be made arbitrary small by taking small 
enough values for e p and et r - 

So far we have restricted ourselves to systems with 
sparse Hamiltonian matrices as well as short-range inter- 
actions between adiabatical fields and electrons. When 
interactions are long ranged while the Hamiltonian ma- 
trices are still sparse, the PEM is applied with less effi- 
ciencies. A local update of the adiabatical fields modu- 
lates large numbers of Hamiltonian matrices, and there- 
fore the number of indices in V* ot may be proportional 
to iVdim in the procedure of the truncation of the trace 
operation. The truncation of the matrix-vector multi- 
plication works similarly as before. Then, the computa- 
tional complexity for a local update of the fields scales 
as 0(M% +1 N) instead of 0{M d+1 ). Similar results will 
be obtained if one performs a global update of the adi- 
abatical fields where number of the fields to be updated 
is O(N). 

However, in the case of systems with dense matrices, 
the PEM is completely inefficient. Namely, a multiplica- 
tion of the Hamiltonian matrix make the vector to prop- 
agate to the whole space, so that procedures to restrict 
the subspace for calculations do not reduce computa- 
tional complexities. Moreover, the matrix-vector mul- 
tiplications cost 0(Nl im ) instead of O(iVdim)- The full 
PEM as well as the truncated PEM gives 0(MN 4 ) com- 
putational complexities, and in this case, the DM should 
simply be used for calculations. 

If we consider a case where calculations are performed 
on small size lattices, the propagation of the vectors 
quickly spreads to the whole space. In other words, the 
Hamiltonian matrices effectively become dense. In this 
case, the truncated PEM does not reduce the compu- 
tational complexities in practice. In general, for each 
system there exists a minimum number for system sizes 
where the truncated PEM is avdantageous compared to 
the DM. The minimum number depends on the range 
of the vector propagations with M-steps, determined by 
model parameters as well as lattice dimensions and ge- 
ometries. 

Near the critical points, improved MC sampling tech- 
niques such as the histogram method or the multicanon- 
ical method are used to overcome the limitations of the 
importance sampling MC method. 11 ) These techniques 
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can also be applied to the truncated PEM, where the 
energy of a sample is given by the "effective electronic 
free energy" defined by 

F off (0) = S* cff {</>)/ 0. (44) 

Let us finally comment on the fact that M may be 
kept to a constant which means that the propagations of 
electrons which contribute to S e g are limited to a finite 
distance R a (M). The situation is valid even at a criti- 
cal point where correlation length for the classical fields 
diverges, since S e g are determined not by correlations 
but by the effective interaction energies among classical 
fields which may be short ranged. From the other point 
of view, actions are in general non-singular at critical 
points so that the polynomial expansion converges sta- 
bly. 
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